15 research outputs found
Conditional Sum-Product Networks: Imposing Structure on Deep Probabilistic Architectures
Probabilistic graphical models are a central tool in AI; however, they are
generally not as expressive as deep neural models, and inference is notoriously
hard and slow. In contrast, deep probabilistic models such as sum-product
networks (SPNs) capture joint distributions in a tractable fashion, but still
lack the expressive power of intractable models based on deep neural networks.
Therefore, we introduce conditional SPNs (CSPNs), conditional density
estimators for multivariate and potentially hybrid domains which allow
harnessing the expressive power of neural networks while still maintaining
tractability guarantees. One way to implement CSPNs is to use an existing SPN
structure and condition its parameters on the input, e.g., via a deep neural
network. This approach, however, might misrepresent the conditional
independence structure present in data. Consequently, we also develop a
structure-learning approach that derives both the structure and parameters of
CSPNs from data. Our experimental evidence demonstrates that CSPNs are
competitive with other probabilistic models and yield superior performance on
multilabel image classification compared to mean field and mixture density
networks. Furthermore, they can successfully be employed as building blocks for
structured probabilistic models, such as autoregressive image models.Comment: 13 pages, 6 figure
Elements of Unsupervised Scene Understanding: Objectives, Structures, and Modalities
Enabling robust interactions between automated systems and the real world is a major goal of artificial intelligence. A key ingredient towards this goal is scene understanding: the ability to process visual imagery into a concise representation of the depicted scene, including the identity, position, and geometry of objects. While supervised deep learning approaches have proven effective at processing visual inputs, the cost of supplying human annotations for training quickly becomes infeasible as the diversity of the inputs and the required level of detail increases, putting full real-world scene understanding out of reach.
For this reason, this thesis investigates unsupervised methods to scene understanding. In particular, we utilize generative models with structured latent variables to facilitate the learning of object-based representations. We start our investigation in an autoencoding setting, where we highlight the capability of such systems to identify objects without human supervision, as well as the advantages of integrating tractable components within them. At the same time, we identify some limitations of this setting, which prevent success in more visually complex environments. Based on this, we then turn to video data, where we leverage the prediction of dynamics to both regularize the representation learning task and to enable applications to reinforcement learning. Finally, to take another step towards a real world setting, we investigate the learning of representations encoding 3D geometry. We discuss various methods to encode and learn about 3D scene structure, and present a model which simultaneously infers the geometry of a given scene, and segments it into objects.
We conclude by discussing future challenges and lessons learned. In particular, we touch on the challenge of modelling uncertainty when inferring 3D geometry, the tradeoffs between various data sources, and the cost of including model structure
Ãœber Zeugnisse, Rangordnung und Versetzung : eine Konferenzvorlage
von K. StelznerProgr.-Nr. 60
Faster attend-infer-repeat with tractable probabilistic models
The recent Attend-Infer-Repeat (AIR) framework marks a milestone in structured probabilistic modeling, as it tackles the challenging problem of unsupcrviscd scene understanding via Baycsian inference. AIR expresses the composition of visual scenes from individual objects, and uses vari-ational autoencoders to model the appearance of those objects. However, inference in the overall model is highly intractable, which hampers its learning speed and makes it prone to suboptimal solutions. In this paper, we show that the speed and robustness of learning in AIR can be considerably improved by replacing the intractable object representations with tractable probabilistic models. In particular, we opt for sum-product networks (SPNs), expressive deep probabilistic models with a rich set of tractable inference routines. The resulting model, called SuPAIR, learns an order of magnitude faster than AIR, treats object occlusions in a consistent manner, and allows for the inclusion of a background noise model, improving the robustness of Bayesian scene understanding
Structured Object-Aware Physics Prediction for Video Modeling and Planning
When humans observe a physical system, they can easily locate objects,
understand their interactions, and anticipate future behavior, even in settings
with complicated and previously unseen interactions. For computers, however,
learning such models from videos in an unsupervised fashion is an unsolved
research problem. In this paper, we present STOVE, a novel state-space model
for videos, which explicitly reasons about objects and their positions,
velocities, and interactions. It is constructed by combining an image model and
a dynamics model in compositional manner and improves on previous work by
reusing the dynamics model for inference, accelerating and regularizing
training. STOVE predicts videos with convincing physical behavior over hundreds
of timesteps, outperforms previous unsupervised models, and even approaches the
performance of supervised baselines. We further demonstrate the strength of our
model as a simulator for sample efficient model-based control in a task with
heavily interacting objects.Comment: Published as a conference paper at 2020 International Conference for
Learning Representation